In this project, we will use our knowledge of analysing single, two and multivariable analysis to discover the patterns among the variables in red wine data and to predict chemicals that influences the quality of red wine.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
From the above results, it clearly shows that the dataset has 1599 rows of data and each row has 13 variables. Out of which two variables are of int datatype i.e., quality and X. X is an variables that carries an unique value for each observation in the dataset. Remaining all other variables are of num datatype.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
The above results, gives the values of min, 1st quartile, median, mean, max and 3rd quartile for each variable in the dataset.
For Univariate Analysis, create plots for each individual variable.
The Volatile Acidity distribution of sample’s in the dataset lies between 0.6 to 1.37. It also has some larger value at 1.6.
The Fixed Acidity distribution of sample’s in the dataset lies in between 4.5 to 15. It also has some larger values at 16.
The Citric Acid distribution of sample’s in the datset lies in between 0 to 0.81. From the above we can observe that citric acid has more number of Zeroe’s in its distribution. It also has some larger values at 1.
The Residual Sugar distribution of sample’s in the dataset lies in between 0.7 to 9.2. It also has some larger values at 13 and 16.
The Chlorides distribution of sample’s in the dataset lies in between 0.1 to 0.28. It also has some larger values at 0.4 and 0.6.
The free.sulfur.dioxide distribution lies in between 0 to 45. It also has larger values at 50 and 70.
The total.sulfur.dioxide distribution of sample’s in the dataset lies in between 10 to 160. It also has larger values at 300.
The Density distribution of sample’s in the dataset lies in between 0.990 t0 1.000.
The pH distribution of sample’s in the dataset 2.75 to 3.75. It also has some large values at 4.0.
The Sulphates distribution of sample’s in the dataset lies in between 0.25 to 1.4. It also has some larger values at 1.6 and 2.0.
The Alcohol Distribution of sample’s in the dataset lies in between 8.2 to 14. It also has some larger values at15.
Interestingly, the patterns observed in the distribution of Quality is as follows: - Most of the sample’s quality is of either 5 or 6. - Fewer number of sample’s quality is of either 7 or 8. - Lesser number of sample’s quality is of either 3 or 4.
There are about 1599 rows of data and has 13 variables.
Since we want to identify the chemicals that influences the quality of red wine, the main feature of interest is quality.
I think alcohol and other acidic properties like citric.acid, volatile.acidity and fixed.acidity will help our investigation, because these properties may influence the taste of the wine.
No, I haven’t created any new variable in this section.
Quality has more values in 5 and 6, where as 3 and 4 has values less in number.
The box plots were plotted against Quality with each other variable’s. Among all plots we can see a strong relation in between quality and alcohol, quality and volatile.acidity, quality and citric.acid & quality and sulphates.
## poor good ideal
## 63 1319 217
The correlation between quality and remaining variables is as follows. To get the variables that has strong correlation with quality.
##
## Pearson's product-moment correlation
##
## data: red_wine$fixed.acidity and as.numeric(red_wine$quality)
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.07548957 0.17202667
## sample estimates:
## cor
## 0.1240516
##
## Pearson's product-moment correlation
##
## data: red_wine$volatile.acidity and as.numeric(red_wine$quality)
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
##
## Pearson's product-moment correlation
##
## data: red_wine$citric.acid and as.numeric(red_wine$quality)
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
##
## Pearson's product-moment correlation
##
## data: red_wine$residual.sugar and as.numeric(red_wine$quality)
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
##
## Pearson's product-moment correlation
##
## data: red_wine$chlorides and as.numeric(red_wine$quality)
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.17681041 -0.08039344
## sample estimates:
## cor
## -0.1289066
##
## Pearson's product-moment correlation
##
## data: red_wine$free.sulfur.dioxide and as.numeric(red_wine$quality)
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.099430290 -0.001638987
## sample estimates:
## cor
## -0.05065606
##
## Pearson's product-moment correlation
##
## data: red_wine$total.sulfur.dioxide and as.numeric(red_wine$quality)
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
##
## Pearson's product-moment correlation
##
## data: red_wine$density and as.numeric(red_wine$quality)
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
##
## Pearson's product-moment correlation
##
## data: red_wine$pH and as.numeric(red_wine$quality)
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
##
## Pearson's product-moment correlation
##
## data: log10(red_wine$sulphates) and as.numeric(red_wine$quality)
## t = 12.967, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2636092 0.3523323
## sample estimates:
## cor
## 0.3086419
##
## Pearson's product-moment correlation
##
## data: red_wine$alcohol and as.numeric(red_wine$quality)
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
As our main interest feature is quality, from above correlation matrix and scatter plot matrix, the most(strong) correlated variables with quality are alcohol, volatile.acidity, sulphates and citric acid.
From the above plot, that is between Quality and Alcohol, we can observe an positive relation among them.
From the above plot, that is between Quality and Volatile Acidity we can observe an negative relation among them.
From the above plot, that is between Quality and Citric Acid we can observe an positive relation among them.
From the above plot, that is between Quality and Sulphates we can observer an positive relation among them.
I found a lot of relationships between the variables against quality, from correlation matrix and scatter plot matrix. Some of the positive correlated variables are alcohol,sulphates,citric acid, fixed acidity. Some of the negative correlated variables are volatile acidity, density, chlorides.
From scatter plot matrix below, i observed an interesting relationship between chlorides and residual sugar. The Scatter plot between cholrides and residual sugar is below.
The strongest relationship that i was found is in betwwen pH and fixed.acidity.
From the above plot, it is observed that Citric Acid and Volatile Acidity has a negative relation among them.
From the above plot, it is observed that Citric Acid and log10(Sulphates) has a positive relation among them.
From the above plot, it is observed that Citric Acid and Alcohol has a positive relation among them.
From the above plot, it is observed that Volatile Acidity and log10(Sulphates) has a positive relation among them.
From the above plot, it is observed that Volatile Acidity and Alcohol has a positive relation among them.
From the above plot, it is observed that Sulphates and Alcohol has a positive relation among them.
After faceting with rating, for a wine to be good it should have higher citric acid and lower volatile acid. It will also have large amount of alcohol and sulpahtes.
This plot depicts about the distribution of wine samples with respect to quality. For most of samples quality is 5 or 6. Where as few of them are less than 5 and few of them are greater than 6.
From the above plot, it is suprisingly noted pH has no impact on the quality of wine.
The above plot shows that the good wines contains large amounts of citric acid. And also depicts that lower the amount of volatile quality, leads to the good wine.
The given dataset has data of 1599 samples of wines.
By performing EDA, i was able to find the patterns among the data like the features that influences the quality of wine.
First I had gone through the individual variable data to understand every variable insights and patterns by plotting an histogram for every variable in the given dataset.
Then from ggpair and correlation matrix, i have found four variables that are strongly correlated with the quality of wine. Then i plotted a bivariate plot between these variables aganist quality.
Multi variate plots are plotted between these variables, to get more deeper understanding how these influences the quality of wine, and i came to know that the good quality wine should have large amount of citric acid and less amount of volatile acidity. Also good wine has large quantities of alcohol content and sulphates in it.
I am not much familiar with chemistry, so i dont have much knowledege of these chemicals, that may limited my insights into the data.
For the future work, i think there should be a wider range of dataset. Like sweetness of wine has impact or no impact on quality of wine etc to get more insights on the quality of wine.